For the explore and summarise data project, the propserLoanData data set is what I chose. The data set has 81 columns & 113937 observations.
#converting the ListingCategory to factor variable & associating them
#for labeling purposes
prosperLoanData$ListingCategory..numeric. <-
factor(prosperLoanData$ListingCategory..numeric.)
cats <- c('0 - Not Available', '1 - Debt Consolidation',
'2 - Home Improvement', '3 - Business', '4 - Personal Loan',
'5 - Student Use', '6 - Auto', '7- Other', '8 - Baby&Adoption',
'9 - Boat', '10 - Cosmetic Procedure', '11 - Engagement Ring',
'12 - Green Loans', '13 - Household Expenses',
'14 - Large Purchases', '15 - Medical/Dental', '16 - Motorcycle',
'17 - RV', '18 - Taxes', '19 - Vacation', '20 - Wedding Loans')
#converting Term into a factor variable
prosperLoanData$Term <- factor(prosperLoanData$Term)
Loan Original Amount has a positive skew. The mean value for the LoanOriginalAmount is 8337 which is greater than the median. The median for the LoanOriginalAmount is 6500.
Monthly Loan Payment has a positive skew. The median is less than the mean for this data. The median for the monthly Loan Payment is 217 and the mean is 272.
credit score range lower has a normal distribution. The mean/median in this case is 685.56/680
Borrower Rate also has a normal distribution and the mean/med ratio is 0.19/0.18 However, there are various spikes especially between 0.24 and 0.31.
Debt To Income Ratio exhibits a slight positive skew with mean > median. Mean is 0.275 and median is 0.22. DI ratio is less than 0.5 for a good percentage of pepole.
Stated Monthly Income has a positive skew with mean > median. Mean for the Stated Monthly Income is 5608 and the median is 4666.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 637.5 1061.5 1224.9 1622.4 171004.2 8554
MonthlyDebtTotal has been added as a new column. This was derived by multiplying the debt to Income Ratio to the stated Monthly Income.
Total Monthly debt has a positive skew on its histogram and the mean is greater than the median. The mean is 1224. The median is 1061. The first quartile for this data is 637 and the third quartile is 1622.
The Employment Status Duration exhibits a positive skew. The mean of the data is higher than the median. The mean is 96 and the median is 67. It is observed that more number of employees worked for shorter durations.
Current Credit Lines has a slight positive skew. The mean/median ratio is 10.31/10.
Loans originate mostly during October, December and January
During the year 2009, there was a drop. This can be attributed to the economic crisis during that year.
## [1] "Loans by Status"
##
## Cancelled Chargedoff Completed
## 5 11992 38074
## Current Defaulted FinalPaymentInProgress
## 56576 5018 205
## Past Due (>120 days) Past Due (1-15 days) Past Due (16-30 days)
## 16 806 265
## Past Due (31-60 days) Past Due (61-90 days) Past Due (91-120 days)
## 363 313 304
Most of the loans are either in the completed status or in current status.
## [1] "# of Loans by income Range"
## $0 $1-24,999 $100,000+ $25,000-49,999 $50,000-74,999
## 621 7274 17337 32192 31050
## $75,000-99,999 Not displayed Not employed
## 16916 7741 806
Most of the loans are taken by people with income range between 25k USD to 50k USD and 50k to 75k USD.
## [1] "# of Loans by term"
##
## 12 36 60
## 1614 87778 24545
Most of the loans are of duration 36 months (3 years)
Top 10 states by the number of borrowers were plotted and this the state California topped the list.
Since the most of the loans belonged to the category of “Other”, the data is not fully sufficient to analyse people with what kind of occupation go for the highest loans.
## [1] "Borrowers - Are they home owners?"
## False True
## 56459 57478
Almost 50% of the borrowers are home owners
Employment Status data doesnt seem to be complete. Also, as anticipated, the employed categroy go for the highest number of loans.
The data set has around 113937 observations and around 81 columns. For our analysis a combination of discrete and continuous variables are being used. All of the variables are not being used. We are using around 20 variables here.
The data set is huge and the more we explore, the more can we get deeper into the data. For now, the focus is on some of the below areas, which was found interesting and intriguing: credit score, debt to income ratio, income, loan amount, interest rate, term etc
Loan Category, Loan Amount, Loan Status etc are available and these will help analyse the data better
LoanYear, LoanMonth etc were derived from the LoanOriginationDate. Also, a new column MonthlyDebtTotal was added by multiplying DI ratio with the stated Monthly Income
Term and ListingCategory..numeric. were converted to factor variables. They were initially categorical values. It was done while making a bar plot which expected that these values are discrete.
## $title
## [1] "Current Credit Lines Vs Total Monthly Debt"
##
## $subtitle
## NULL
##
## attr(,"class")
## [1] "labels"
There is a clear positive correlation between Current CreditLines and the Total Monthly Debt and the value of the correlation coeffient is 0.47.
There is a clear negative correlation betwen the CreditScores and the borrower rate. The correlation coefficient is -0.46. Persons with a lower credit scores get the loans at a lower borrower rate.
There is a positive correlation between the stated Monthly Income and the total monthly debt. The correlation coefficient is 0.36.
The correlation between the loan Amount and Borrower Rate is negative. The value of the correlation coeffient is -0.32. This would mean that the higher loans are disbursed at a lower interest rate.
There is a slight positive correlation between loan Amount and Credit Scores. Higher the loans, the credit scores can increase.
Monthly Income Vs Total Monthly Debt was plotted and was facet wrapped by IsBorrowerHomeowner.The dispersion is higher in the case of home owners and for the non home owners, the dispersion is concentrated more amoung the borrowers with an income of around 5000.
The median amount dips in 2009 and has a sharp rise in 2013.
Dec to Feb is the period when the loan amounts are usually higher.
The higher amount loans are taken for Debt consolidation and baby and adoption
Borrowers with large incomes take large loan.
Employed and Self employed are the ones who are taking higher loans
Higher Loans Amounts are repaid thorough longer terms
Higher the term of repayment does not always means that the interest rates are higher.
Term and Credit Scores - Not much insights can be drawn from this plot.
Total Monthly Debt and Current Credit Lines have a strong positive correlation Longer the term of the loan, it is found that the loan amount is also larger Employed and self employed ones borrow larger amounts
Could not do this analysis. The credit scores and the salary ranges, credit scores and employment status are some of the relationships I wanted to analyse.
CurrentCreditLines and MonthlyDebtTotal had a strong +ve correlation with a correlation coefficient of 0.47. Also, credit scores and borrower rate have a strong positive correlation with a correlation coefficient of 0.46
The top left ones are the ones which takes loans for a smaller amount and at higher rates. These are the borrowers with lower credit scores. The borrowers with higher credit scores (red colored) gets higher amount of loans at lower rates. (Only values corresponding to the first quartile are plotted for better visualisation purposes)
MonthlyLoanPayment was divided by the StatedMonthlyIncome to observe the yearwise debt coverage ratio. This was faceted by IncomeRange. All the income ranges have an uptick in the most recent years.
Most of the borrowers seem to have DI ratio of 0.2 to 0.3 and is showing an uptick in the most recent years. Borrowers with income range $100k + have a low DI ratio and is around 0.15-0.20
(For better visualisation, only credit scores from the first quartile are plotted here)
The above plot shows how the lending platform Prosper has matured over years. Red shows higher credit scores. We can see that by the year 2014, blue dominates,which would also mean that the over all risk exposure has increased. That is loans are given out to borrowers with lower credit scores at a higher rate. If we take a look at the chart for 2012, we can see that those days, higher amount of loans were released mainly for borrowers with high credit scores.
Again, this plot says that as the prosper lending platform matures, probably the risk management also has improved. In the earlier years red (high debt to income ratio) is almost not seen in the charts. In the later years, we can observe that loans are disbursed to borrowers with high debt to income ratio as well & hence dominance of the red color in the charts as the years are passing by.
Only deliquent loans were subsetted here. The majority of the delinquencies are clustered in 1-15days past due. Maximum delinqueny is happening for the borrowers having income ranges between $25k-50k and $50-75k.
Borrowers with lower credit scores takes loans for a smaller amount & the loans are at disbursed to them at a higher interest rate
Also, it was observed that borrowers with the highest income have a lower percentage of debt.
Taking a look at the DI ratio, it was observed that higher the income, the lower is the percentage of debt.
The color red represents the borrowers with a high credit scores. They opt for higher amount of loans and the rates at which the loans are taken are lower for them as compared to the ones with lower credit scores.
The above plot shows how the lending platform Prosper has matured over years.
This chart digs deeper into plot one and splits the data across years.
Red shows higher credit scores and we can see that by the year 2014, blue dominates higher loan amount section (> 20000 USD),which would also mean that the over all risk exposure has increased and the lending platform is now giving out higher amout of loans to borrowers with lower credit scores as well & to cover the risk probably,they are being charged a higher interest rate.
(For better visualisation, only credit scores from the first quartile are plotted here)
This plot reinforces the above finding that the risk management strategies has been worked upon at Prosper as the platform matured. This could be the reason that unlike the earlier years, a good percentage of laons are disbursed for borrowers with high debt to income ratio.
The data set has almost 114,000 loans.
The most time consuming part of this analysis was to find some variables which will help make some good analysis & to list down what are the analysis that has to be performed.
There was too much of information that, it was not difficult to atleast perform the single variable and bi variable analysis. There were too many choices. However, drilling down into multi variable analysis from the results of the bi variable analysis was again difficult.
Making the choice for plotting three plots as the outcome of the entire analysis was again difficult since there was too much of data to be chosen from.
Regression models & some predictions are what I would have tried working on if time permitted. For example, preclosure, delinquency, late payment etc if it can be predicted, it would have been good.
One of the good decisions was to restrict the analysis to limited number of variables and not look into the entire data to perform the analysis. This helped finish the analysis comparitively faster.